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ABSTRACT 

An  analysis  Is  performed  of  the  memory  bandwidth  that  results  in  a 
system  consisting  of  multiple  processors  and  a shared,  Interleaved  memory. 
Special  emphasis  is  placed  on  studying  and  modeling  the  memory  referencing 
behavior  of  the  programs  that  execute  on  the  processor  and  in  ascertaining 
the  effect  that  this  behavior  has  on  the  memory  bandwidth.  Some  coranonly  used 
models  of  program  behavior  are  shown  to  be  unrealistic  and,  instead,  a new,  more 
sophisticated  model  is  proposed  and  validated  by  means  of  measurements  on  four 
trace  tapes.  Rased  on  this  model  of  program  behavior,  a simple  structural 
model  of  the  system  is  analyzed.  This  is  effected  by  extending  a technique 
previously  developed  for  use  with  a simpler  model  of  program  behavior  and  by 
advancing  a now  technique  based  on  the  diffusion  approximat ion  and  probability 
generating  functional  methods.  The  latter  analysis  is  shown,  via  simulations, 
to  be  very  accurate.  Finally,  it  is  demonstrated  that  program  behavior,  in 
this  context,  may  be  captured  by  a simple,  two-parameter  model. 
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1.  Introduction 

In  view  of  the  fact  that  a large  fraction  of  the  instructions 
executed  in  any  processor  require  operands  from  memory,  the  memory  access 
time  is  a fundamental  determinant  of  overall  performance.  The  average  memory 
access  time  is  directly  affected  by  the  technology  used  to  construct  the 
memory.  When  performance  requires  a shorter  memory  access  time,  one  may  use 
either  a faster  technology  for  the  entire  memory  or  a memory  hierarchy.  The 
latter  alternative  is,  generally,  the  more  cost-effective  and  is  widely  used 
in  many  medium  and  high  speed  computers  [1-5].  The  effectiveness  of  such  a 
strategy  has  been  demonstrated  in  theory  and  in  practice  by  a number  of  worker;; 
[6-11]. 

A second  factor  that  contributes  to  the  memory  access  time  is  the 
degradation  that  is  experienced  when  a number  of  requests  vie  with  one  another 
for  the  use  of  the  memory.  The  solution  that  is  employed  is  to  partition  the 
memory  into  a number  of  distinct  physical  resources  called  modules,  each  of 
which  may  service  a request  in  parallel  with  the  other  modules.  The  sets  of 
memory  locations  contained  in  the  various  modules  are  disjoint  and  a request 
can,  therefore,  be  served  only  by  a specific  module.  Depending  on  the  pattern 
of  requests,  it  is  possible  for  a number  of  requests  to  be  conflicting  for  a 
subset  of  the  modules,  while  the  remaining  modules  lie  id|.e  with  a consequent 
increase  in  the  average  access  time  and  a reduction  in  the  memory  bandwidth, 
i.e.,  the  mean  rate  at  which  requests  are  served.  The  objective  of  this  paper 
is  to  study  this  phenomenon. 

The  use  of  a cache  memory  reduces  the  memory  interference  problem 
at  main  memory  substantially  since  the  block-fetches  to  the  cache  exercise 
main  memory  in  an  optimal  fashion  - sequentially.  However,  in  a multiprocessor 
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environment,  interference  between  requests  takes  place  at  the  cache  if  it  is 
common  to  all  the  processors.  An  alternative  is  to  provide  each  processor  with 
its  own  cache.  While  this  does  eliminate  the  interference, it  introduces  the 
problem  of  updating  multiple  copies  of  the  same  item.  Every  store  operation 
must  be  cross-checked  with  every  cache  and  since  stores  account  for  approxi- 
mately a fifth  of  all  memory  references,  the  interference  problem  is  merely 
reduced,  not  eliminated.  Consequently,  the  analysis  of  interleaved  memories 
remains  a worthwhile  and  relevant  study. 

The  bandwidth  obtained  from  the  memory  depends  on  a number  of  factors 
relating  to  the  memory  sub-system,  the  processor  sub-system  and  the  processor- 
memory  interconnection.  Consequently,  the  structural  model  of  the  system  should, 
ideally,  incorporate  all  these  factors. 

1.1.  The  processor  model 

The  processor  sub-system  may  be  parameterized  as  follows: 

1.  the  number  of  processors,  N, 

2.  the  maximum  request  issuance  rate, 

3.  the  number  of  pending  requests  that  the  processor  is  able  to  buffer, 

4.  the  interval  between  successive  requests  from  a processor,  etc. 

The  tendency  appears  to  have  been  to  study  either  a uniprocessor  which  is  able 
to  generate  an  unlimited  number  of  pending  requests  [12-19]  or  a multiprocessor 
in  which  each  processor  can  have  only  one  outstanding  request  [20-27].  Most 
studies  have  assumed  that  each  processor  generates  a request  every  cycle  unless 
blocked  but  there  have  been  some  exceptions;  Flores  assumes  that  the  request 
arrivals  form  a Poisson  process  [12],  Baskett  and  Smith  allow  for  a processor 
to  wait  a geometrically  distributed  time  between  having  a request  serviced  and 
generating  the  next  request  [24].  This  "think"  time  has  also  been  assumed  to 
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be  of  constant  length  [21,  23]  or  distributed  arbitrarily  with  a given  mean 
[25]. 

1.2.  The  memory  model 

Iha  memory  sub-system  is  described  by  the  following  parameters: 

1.  the  number  of  memory  modules,  M, 

2.  the  width  of  each  module,  i.e.,  the  unit  of  access  of  information 
from  the  module, 

3.  the  type  of  interleaving  - high-order  or  low-order, 

4.  the  extent  and  type  of  buffers  for  queueing  of  requests, 

5.  the  access  time  of  the  memory, 

6.  the  cycle  time  of  the  memory, 

7.  whether  the  modules  operate  synchronously. 

By  and  large,  the  width  of  the  module  has  been  assumed  to  be  the  same  as  the 
unit  of  request,  but  Coffman,  Burnett  and  Snowdon  study  the  case  where  the 
width  of  the  module  is  a multiple  of  the  unit  of  request  by  the  processor  [15]. 
High-order  interleaving,  in  which  all  locations,  whose  addresses  have  the 
same  high-order  bits,  go  into  the  same  module,  has  certain  advantages  from  the 
viewpoint  of  reliability  and  availability.  However,  low-order  interleaving 
is  employed  if  enhanced  bandwidth  is  the  objective.  Many  models  have  assumed 
that  the  memory  sub-system  cannot  accomodace  requests  awaiting  service  [13,14, 
17,19]  and,  accordingly,  the  processor  is  blocked  as  soon  as  a request  is  to 
a busy  module.  Flores  assumes  an  infinitely  large  queueing  buffer  per  module 
whereas  Coffman,  Sumett  and  Snowdon  consider  a finite  sired  conflict  buffer 
common  to  the  modules.  Although  most  work  has  assumed  that  the  cycle  time  and 


access  time  are  identical  there  have  been  some  exceptions. 

1.3.  The  processor-memory  lnte  r connect  ion 

Although  many  types  of  interconnections  may  be  conceived  of,  just  a hand 
ful  have  received  serious  attention.  These  are: 

1.  a shared,  time-multiplexed  bus, 

2.  a crossbar  connection  such  that  every  processor  is  connected  to  every 
module  and 

3.  a combination  of  the  above  two. 

The  majority  of  uniprocessor  bandwidth  studies  have  implicitly  assumed  a time- 
multiplexed  bus  [12-19].  Of  these,  all  but  [19]  have  assumed  that  the  bus 
cycle  is  negligible  and  that  all  requests  are  made  simultaneously.  Most  multi 
processor  studies,  on  the  other  hand,  have  assumed  a crossbar  interconnection 
between  the  processors  and  the  memories.  Briggs  and  Davidson  [28],  however, 
have  investigated  a system  consisting  of  l time-multiplexed  busses  each  of 
which  is  connected  to  m memory  modules. 

1.4.  The  structural  model  under  study 

The  structural  model  which  will  be  analyzed  in  this  paper  is  a simpli- 
fication of  one  presented  by  Skinner  and  Asher  [20]  and  which  has  been  fre- 
quently used  by  subsequent  workers  [21-24,26,27].  Specifically,  the  model 
makes  the  following  assumptions: 

1.  the  system  contains  N statistically  identical  processors, 

2.  each  processor  may  have  only  one  outstanding  request  and  is  blocked 
while  the  request  is  awaiting  service, 

3.  when  a processor  is  serviced  it  immediately  generates  a request  the 
very  next  cycle, 

the  system  contains  M memory  modules, 
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5.  the  module  width  is  1 double  word  (8  bytes)  which  is  also  the  unit 
of  request, 

b.  the  modules  are  interleaved  on  the  low-order  bits  of  the  address  that 
is  obtained  after  discarding  the  lowest-order  three  bits  (since  the 
module  width  is  8 bytes), 

7.  the  memory  has  no  queueing  buffer  space;  however,  the  processors  may 
be  conceptually  viewed  as  being  queued  on  the  module  that  they  are 
attempting  to  reference, 

8.  the  access  time  and  the  cycle  time  of  the  modules  are  identical  and 
serve  as  the  unit  of  time, 

9.  the  modules  operate  synchronously  and 

10.  the  processors  and  memory  are  interconnected  by  a crossbar  switch. 

Such  a model  is  fairly  successful  as  an  abstraction  of  a system  such  as  C.mmp 
[29]. 

An  issue  that  has  been  neglected  in  the  above  discussion  of  system 
modeling  is  the  behavior  of  the  references  generated  by  each  processor.  This 
reference  pattern  can  affect  the  bandwidth  substantially  and  is  discussed  at 
length  in  Section  2.  Section  3 reviews  various  analytical  techniques  that 
have  been  employed  in  the  past  and  attempts  to  extend  them  to  the  model  of 
reference  behavior  that  is  assumed  here.  Section  4 develops  an  exact  analysis 
which,  by  itself,  is  unable  to  provide  the  desired  results.  However,  in 
Section  5,  this  analysis,  in  conjunction  with  the  diffusion  analysis  of  Section 
3,  yields  the  expression  for  memory  bandwidth.  Finally,  in  Section  6,  a 
simplified  two-parameter  model  of  program  behavior  is  advanced  and  shown  to 
be  quite  satisfactory. 
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2.  A Model  of  Program  Behavior 

A perusal  of  the  literature  on  interleaved  memories  reveals  very  few 
attempts  to  measure  and  study  the  effects  of  program  behavior  upon  interleaved 
memory  bandwidth.  In  the  area  of  uniprocessor  systems,  Burnett  and  Coffman 
[14]  modeled  the  reference  string  by  separately  modeling  the  instruction  and 
data  references  and  assuming  that  alternate  cycles  are  dedicated  to  making 
instruction  or  data  requests.  The  instruction  stream  is  described  by  the 
parameter  X,  the  probability  that  an  instruction  is  a successful  branch  (pre- 
sumably to  a target  which  is  equally  likely  to  lie  in  any  module).  With 
probability  (l-\),  the  next  instruction  is  the  sequential  successor  of  the 
current  one.  The  model  for  the  data  stream  employs  a parameter  a,  the  prob- 
ability that  the  next  data  request  will  be  to  the  sequentially  following  module 
(modulo  M) . With  probability  (l-a)/(M-l)  the  next  data  request  is  to  any  one 
of  the  remaining  modules.  Although  unable  to  analytically  evaluate  this 
model  in  the  general  case,  Burnett  and  Coffman  did  so  via  extensive  simula- 
tions. Unfortunately,  they  did  not  validate  this  model  of  program  behavior 
with  measurements  on  real  programs.  Terman  has  tested  the  validity  of  this 
model  using  trace-driven  simulations  [18].  The  technique  he  used  was  to 
measure  X and  a and  use  Burnett  and  Coffman’s  results  to  predict  the  band- 
width. This  he  then  compared  with  direct  simulation  measurements  of  band- 
width. Terman's  results  indicate  that  whereas  the  model  is  adequate  for  the 
instruction  stream,  it  is  ineffective  for  the  data  stream.  Furthermore,  the 
model  is  not  directly  applicable  when  the  instruction  and  data  streams  are 
merged  as  is  normally  the  case. 

The  Least-Recently-Used  Stack  Model  of  program  behavior  has  been 
used  as  a model  of  memory  referencing  behavior  in  the  uniprocessor  environment 
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[19].  The  advantage  of  this  model  is  that  it  is  able  to  capture  the  property 
which  leads  to  the  clustering  in  time  of  references  to  the  same  module.  It  is 
also  amenable  to  analysis  ard  has  been  shown  to  be  quite  accurate  in  its  pre- 
dictions of  observed  bandwidth. 

The  tendency  in  most  multiprocessor  bandwidth  studies  has  been  to 
assume  that  the  sequence  of  requests  generated  by  each  processor  can  be 
modeled  by  the  Random  Independent  Reference  Model,  (RIRM),  in  which  every 
request  is  independent  of  every  other  request  and  has  an  equal  probability, 

1/M  of  being  directed  to  any  module.  There  have  been  two  exceptions; 
Hoogendoorn  [25]  models  the  reference  string  by  the  Independent  Reference  Model, 
( IRM) , whereby  each  request  from  processor  i has  an  independent  and  constant 


probability,  p..,  of  being  directed  to  module  j where  Ep  =1.  p need  not  be  equal 

J j ^ J 1 J 


to  p for  i#k.  He  is  able  to  perform  an  approximate  analysis  which  compares 
kj 


well  with  simulation  results.  Sethi  and  Deo  [26]  have  employed  a localized 
model  of  referencing  in  which  with  probability  o',  a processor  references  the 
same  module  that  it  last  referenced  and  with  probability  (l-a)/(M-l)  it 
references  any  one  of  the  remaining  modules.  They  propose  this  as  a model  of 
referencing  behavior  in  a memory  that  is  interleaved  on  the  high-order  bits. 
They  do  not  validate  their  model  by  measurements  nor  are  they  able  to  derive 

1 l 

an  expression  for  memory  bandwidth  in  the  general  case,  although  they  do  form 
conjectures  based  on  the  exact  results  for  small  numbers  of  processors  or 
memo  ry  modu le  s . 

In  developing  a model  of  program  behavior,  one  must  take  care  to 
ensure  that  the  model  captures  most  of  the  relevant  characteristics  that 
affect  bandwidth  while  at  the  same  time  retaining  analytic  tractability . The 
first  step  taken  was  to  check  whether  the  RIRM  or  IRM  were  in  fact  sufficient 
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or  whether  a more  sophisticated  model  was  called  for.  Accordingly,  a series 
of  simulations  were  performed  In  which  systems  consisting  of  4 processors  and 
M memory  modules  (M  “2,4,8,16)  were  studied.  Four  trace  tapes  were  used  to 
generate  the  request  sequences  for  the  four  processors.  These  trace  tapes  are ; 

043  - Fortran  execution 

049  - Sort  execution 

050  - Cobol  compilation 

, 051  - Cobol  compilation 

These  trace  tapes  are  part  of  a much  larger  library  of  tapes  [30].  Hie  memory 
modules  were  assumed  to  have  a width  of  8 bytes  (l.e.,  the  low-order  3 bits 
of  each  request  address  were  discarded)  and  they  were  interleaved  based  on 
the  low-order  bits  of  the  remaining  portion  of  the  address.  Since  it  is 
possible  for  a number  of  processors  to  reference  a module  in  the  same  cycle 
it  is  necessary  to  define  a priority  by  which  requests  are  entered  into  the 
queue  associated  with  the  module.  The  scheme  used  was  one  with  a "rotating 
priority"  such  that  in  cycle  j,  the  processors  would  be  inspected  in  the  order 
J,  J +1.  J+2,  J + 3 (modulo  4)  to  see  if  they  had  any  request  to  make,  in 

which  case  the  request  would  be  entered  into  the  appropriate  queue  before 

i , t 

proceeding  to  the  next  processor.  As  a Consequence,  no  processor  gets  pre- 
ferential treatment  in  the  long  run.  Other  than  this,  the  simulation  conformed 
to  all  the  structural  assumptions  listed  in  Section  l.  Each  simulation  was 
run  for  10,000  memory  cycles.  Two  sets  of  measurements  were  taken  for  each 
simulation  run;  firstly,  the  long  run  probability  of  referencing  a module 
averaged  over  all  four  programs  was  estimated  to  allow  for  the  testing  of  RlRM 
hypothesis  and,  secondly,  the  long  run  conditional  probability  q{ 


of  a pro- 
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cessor  referencing  module  j given  that  module  l was  last  referenced  by  the 
processor  was  estimated,  once  again  averaged  over  all  four  programs.  These 
statistics  are  presented  In  Tables  l,  2 and  3 for  M equal  to  2 , 4 and  8 res- 
pectively. (The  statistics  for  M ” lb  were  gathered  but  are  not  presented 
here ) . 

An  examination  of  these  three  tables  reveals  that  the  long  term 

probability  of  referencing  a module  Is  approximately  l/M  and  an  IRM  model  with 

unequal  reference  probabilities  Is  ruled  out.  Next,  an  examination  of  the 

estimated  first-order  Markov  transition  matrices  shows  that  q *q  for  all 

l]  kj 

l,k.  Hence,  the  hypothesis  of  Independence  between  successive  requests  must 
be  rejected  and  along  with  It  the  RIRM.  Finally,  the  localized  reference 
model  of  Sethi  and  Deo  requires  that  q^"q^  for  all  J,  k*l.  The  measure- 
ments Indicate  that  whereas  q,^  Is  approximately  equal  to  q(]  for  J and  k 
not  equal  to  l or  l + l , there  Is  a pronounced  tendency  to  reference  either 
the  same  module  again  or  the  sequential  successor  (modulo  Ml.  The  localizing 
effect  hypothesized  by  Sethi  and  Deo  In  a hlgh-order  Interleaved  memory  Is  In 
fact  present  In  the  low-order  Interleaved  memory  too,  but.  In  addition,  there 
Is  a significant  measure  of  sequentiality  In  the  reference  string  generated 
by  a program.  The  enhanced  probability  of  referencing  the  same  module  Is 
explained  by  operations  of  the  read-modlfy-wr Ite  type  and  by  bv»e  manipulating 
Instruction  (SS  Instructions J with  separately  access  the  Individual  bytes  lit 
a double  word  (8  bytes J.  The  observed  sequentiality  may  be  explained  b the 
sequentiality  Inherent  In  the  Instruction  stream  as  well  as  by  operations  on 
arrays.  That  the  data  stream,  too,  has  significant  sequential Itv  has  been 
demonstrated  [31). 

On  the  basis  of  these  measurements  It  Is  clear  that  the  RIRM  and  \ RM 
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must  be  rejected  and  that  the  localized  reference  model  is  inadequate.  The 
model  of  program  behavior  used  in  this  paper  is  a first-order  Markov  model, 
i.e.,  one  in  which  the  module  reference  probabilities  for  a program  depend  on 
the  module  that  was  last  referenced  by  that  program.  Such  a model  may  be 
expected  to  capture  both  localized  referencing  behavior  as  well  as  sequentiality 
and  contains  as  special  cases  the  IRM  and  RIRM.  It  is  possible  and,  in  fact, 
probable  that  this  model  is  not  adequate  in  the  sense  that  higher  order  depen- 
dencies do,  in  fact,  exists  between  references.  However,  subsequent  results 
will  show  that  such  dependencies,  to  whatever  extent  they  exist,  do  not 
materially  affect  the  bandwidth.  Furthermore,  this  model  has  the  important 
advantage  of  being  amenable  to  analysis. 

One  last  observation  which  simplifies  the  analysis  is  ?:hat  the 


estimated  first-order  transition  matrix  is  circularly  symmetric,  i.e.,  q 
is  approximately  equal  to  q^+^  j ^or  *■»  J an<*  This,  of  course, 
is  to  be  expected  since  a program  is  equally  likely  to  be  placed  in  memory 
at  one  location  as  it  is  to  be  placed  in  another  which  is  displaced  by  k 
modulo  M.  The  first-order  transition  matrix  is  fully  described  by  M parameters, 
pQ  through  pM_L  where  p^  is  defined  equal  to  ~qt  i+lc/M.  Thus  pQ  is  the  prob- 
ability of  re-referencing  the  same  module  and  p^  is  the  probability  of  referencing 
the  sequentially  following  module.  This  model,  described  by  the  parameters 
L pt } will  be  termed  the  Symmetric  First-Order  Markov  Model  (SFMM) . The  values 
of  the  parameters  averaged  over  all  four  programs  are  presented  in  Table  4 for 
values  of  M of  2,4,3,  and  16. 


3.  Analytical  techniques 

In  this  section  we  review  a number  of  techniques,  mostly  of  an  approxi 
mate  nature,  which  have  been  used  in  analyzing  the  same  structural  model 
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but  with  the  RIRM  assumption  for  program  behavior.  Our  interest  lies  in 
attempting  to  extend  these  techniques  to  handle  the  SFMM. 

3.1.  Strecker's  approximation 

The  essence  of  Strecker's  approximation  [21]  is  that  a processor 
that  is  blocked  at  memory  during  one  cycle  discards  that  request  and  proceeds 
to  generate  requests  as  if  it  had  in  fact  been  served.  The  expression  obtained 
for  the  bandwidth  only  takes  into  account  the  combinatorial  interference  and 
ignores  the  queueing  effect.  The  equivalent  analysis  for  the  SFMM  assumes 
that  each  processor  generates  requests  without  interference  from  the  other 
processes.  The  probability  of  any  given  module  being  idle  is  the  probability 
that  no  processor  is  currently  referencing  it.  Since  the  generation  of 
references  by  each  processor  is,  by  assumption,  independent  of  every  other 
processor's  activity,  the  probability  that  no  processor  is  referencing  a 
module  is  simply  the  product  of  the  individual  probabilities  for  the  N pro- 
cessors of  not  referencing  the  given  module.  In  view  of  the  symmetric  nature 
of  the  model  of  program  behavior,  this  individual  probability  is  i L - 1 /M> . 

Hence  the  probability  that  no  processor  is  currently  referencing  a given 
N 

module  is  (1-l/M)  . The  bandwidth  is  accordingly  given  by 
B(M,N)  - M[l  - (1  - 1/M)N] 

which  is  exactly  the  same  as  for  the  RIRM  case.  Strecker's  approximation  does 
not  permit  the  analysis  to  benefit  from  the  increased  sophistication  of  the 
SFMM. 

3.2.  Exact  Markov  analysis 

An  exact  analysis  similar  to  that  used  by  Bhandarkar  [23]  for  the 
RIRM  is  a possibility.  However,  the  size  of  the  state  space  is  much  larger 
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since  the  aggregation  of  states  that  is  permissible  for  the  RIRM  cannot  be 
performed  in  the  case  of  the  SFMM.  In  particular,  let  <n^,...,n^  represents 

the  state  of  the  system  where  n^  is  the  number  of  processors  enqueued  or  In 
service  at  module  i.  In  the  case  of  the  RIRM,  any  state  obtained  by  permuting 
the  numbers  n^  through  ^is  equivalent  to  the  original  state  and  all  M!  such 
states  may  be  aggregated.  With  the  SFMM,  only  cyclic  permutations,  such  as 

<n^.....nM  0^ n^  ^>,  produce  equivalent  states  and,  therefore,  only 

groups  of  M states  can  be  aggregated.  The  number  of  states  before  aggregation 
of  equivalent  states  can  be  shown  [32]  to  be  equal  to 

N+M- 

i 1 / 

and  after  aggregation  it  is  reduced  approximately  by  the  factor  M.  The 
exact  solution  of  the  Markov  chain  requires  the  enumeration  of  all  the  states 
and  the  computation  of  the  transition  probabilities  along  the  lines  of  [23]. 

This  can  be  prohibitively  expensive  for  large  values  of  M or  N or  if  a Large 
number  of  (M,N)  pairs  are  to  be  studied.  It  is  preferable  to  develop  analyti- 
cal techniques  which  do  not  require  the  enumeration  of  the  state  space. 

3.3.  The  exponential  queueing  approximation 

An  approximation  introduced  by  Bhandarkar  and  Fuller  [22]  was  to 
replace  each  memory  module  by  an  exponential  server  with  the  same  mean  service 
time  of  1 cycle.  The  resulting  closed  quencing  system  falls  within  the  class 
of  queueing  networks  considered  by  Gordon  and  Newell  [33].  Using  their 
results  it  can  be  shown  that,  since  the  transition  matrix  for  the  SFMM  is 
doubly  stochastic  and  since  all  servers  have  the  same  mean  service  time,  all 
states  are  equally  probable.  Then  by  using  the  arguments  in  [22]  the  band- 
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width  is  found  to  be 

; . 


NM 

N +M  - l 


This  result,  once  again,  is  unchanged  from  the  RIRM  case  and,  so,  this 
analysis,  too,  must  be  discarded  as  inadequate  when  studying  the  SFMM. 

3.4.  The  diffusion  approximation 

The  source  of  error  in  the  previous  analysis  was  that  the  structural 
model  was  compromised  by  substituting  a exponentially  distributed  service  time 
for  a constant  service  time.  However,  this  permitted  an  exact  analysis.  An 
alternative  is  to  approximate  the  analysis  but  not  the  model.  The  diffusion 
approximation  [34,35]  does  just  this  and  is  the  only  general  technique  for 
handling  queueing  networks  that  fall  outside  the  class  of  the  so-called  "local 
balance"  networks  [36,37]. 

Kobayashi  [38]  and  Reiser  and  Kobayashi  [39]  have  developed  the 
diffusion  approximation  for  general  queueing  networks  and  have  studied  its 
accuracy.  Bhandarkar  and  Fuller  have  applied  this  theory  to  the  interleaved 
memory  problem  [22]  and,  surprisingly,  they  obtained  exactly  the  same  result 
as  with  the  exponential  server  model.  This  is  probably  due  to  the  fact  that 
they  neglected  the  boundary  conditions  in  the  analysis,  by  paying  attention 
to  which  one  obtains  a considerably  different  result. 

In  [39],  the  approximate  Joint  queue  length  distribution  for  a 
closed  network  of  queues  is  shown  to  be 

M 

p(n1,n2,...,nM)  - tt  II  p^n^ 


for  all  feasible  states  (n^ ,n0 , . . .n^)  where  n^  is  the  queue  length  at  server 
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i,  + ^ + . . • + n^  * N,  and  n is  a normalization  constant  chosen  so  that  the 
total  probability  over  all  feasible  states  sums  to  unity.  In  this  expression 


P^n) 


and  P^  is  given  by 


for  n = 0 


for  n > 1 


£ ^ ■ exp  ( - 


2(>ii  -X(i)) 
a iKi 


where 


* service  rate  of  server  i, 


\K  ' * arrival  rate  to  queue  i, 


■ squared  coefficient  of  variation  of  the  service  time  at 
server  i, 


■ squared  coefficient  of  variation  of  the  inter-arrival  time 


at  queue  i 


■ an  unknown  quantity. 

Other  expressions  for  calculating  p ^ have  been  suggested  [40,41]  and  they 
generally  yield  greater  accuracy. 

The  symmetry  of  the  system  under  study  here  permits  some  amount  of 
simplification  since  u^  * u,  p^  = P and  P^(n)“p(n)  for  all  i.  This  fact  may  be 
readily  established  in  a more  rigorous  manner.  Consequently, 


15 


* . , ~.M-kr,.,  *.,k«N-k 

P(nl#n2> . . . ,nM)  *"(l-u)  (u(l-P)]  p 


, ,,  - VM*N,  f u(  1 -p  ) , 1 

lTTa-u>  3 11  Ti^r1 


where  k is  the  number  of  busy  servers, 

e - “L1:11} 

(1-uV' 

and  tt*  is  a new  normalization  constant.  However,  £ cannot  be  evaluated  since 
u is  unknown  and  the  literature  does  not  provide  any  satisfactory  procedure 
for  calculating  it.  For  this  reason,  the  method  of  calculating  £ 
is  irrelevant.  In  Section  5.3,  a method  for  directly  evaluating  9 is  developed 
specifically  for  the  multiprocessor  memory  bandwidth  problem.  (.If  p is 
arbitrarily  chosen  to  be  1 then  every  state  is  equiprobable  and  the  result 
obtained  in  (22]  follows]. 

Over  looking  the  fact  that  <?  is  unknown,  one  may  proceed  to  obtain 

M 

results  which  will  be  of  use  in  Section  5.3.  Clearly,  there  are  l 3 ways 

of  selecting  i busy  modules.  For  each  one  of  these  combinations  there  exist 

(^_J)  ways  of  assigning  the  N processors  to  the  i busy  modules  if  the  processors 

are  considered  indistinguishable  [32].  (Since  the  state  of  a processor's 

reference  process  is  given  by  the  module  last  referenced,  all  processors  queued 

on  a module  are,  indeed,  indistinguishable) . Consequently,  the  number  of 

M N- 1 

states  corresponding  to  t busy  modules  is  given  by  (‘^3  (,“.  ^3.  rhe  probability 
assigned  to  each  such  state  is  n 1 t?  and  so,  by  the  definition  of  n ' , 


' v (M)(N*Sdi' 

- vi  ,vt-l ' 
i-1  1 


l and 


lb 


tt1  - 1/  " e1  <“)(”'b 

1-1 

The  bandwidth  is  given  by  the  average  number  of  busy  memory  modules  and  so 

E i0i(MwN-1 

B(M,N)  - E - tII  ~-  m -|'.l 

1-1  1 re1  ("hJ  h 

1-1  1 

* .1  , M-l  w N-l 


r 9i.  M-l  WN-1 . r e1  c”;1  )<V  ) 

- i-l0  { 1-1  } ( 1-1  ' _ i-0 

m a — ; 

E M-l  N-l  E il_,M-l  N-l 

i*i  i4-iMi-i'  i*oi  + r iM  i ' 


(i) 


The  probability  of  a given  module  being  idle  may  be  calculated  from 

M- 1 

the  expression  for  the  bandwidth  or  by  noting  that  there  are  ( ) ways  of 

selecting  i busy  modules  from  the  remaining  (M-l)  modules  and  that  for  each 

N-l 

of  these  there  are  ( ways  of  assigning  the  N processors  to  the  i busy 
modules.  Therefore, 

P[a  given  module  is  idle]  - n v , b |) 

i-l  1 1-1 


e 0i(M~b(N"b 
i+1  i 

* fli  M WN-1 

iSo  <1+,,<  1 
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Similarly, 


nr . J . . , E ni ,M-2.  N-l. 

P[two  given  modules  are  idle]  = 9 (i+^)(  ^ ) 


Using  arguments  of  the  same  nature  one  finds  that 


(3) 
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P[ queue  length  at  a given  module  * k and  another  given  module  is  idle] 


,i  m-2  N-l 
' Z * ( ' )C  > 

i«l 


k - 0 


rt ’ E el+l(M’2)(H:k:1)  N > k > 0 

i-1  1-1 


k - N 


Hence,  the  average  queue  length  at  a given  module  given  that  some  other  module 
is  idle 


tt'(  r kE  el+1<M:V:k:1)  + N e ) 

k-1  1-1 v 

"’(i  rei+l(>,:2)(N'k1'1)  + 0+  r «l(M‘2)cJ"})) 

k-l  1 w i-1  1 1-1 

v ei+i  M-2  N;1  N-k-l 

i-1  ( 1 \*l k(  1-1  ) + Ne 

r ei+l(M;2)N;1  (N:k:1)  + e + e ei+1(^)(N-S 

i-1  k-l  i-0  11  l 


® l+l  m-2  N 

Jo9  <v<l!l) 


r .l7;V;1)  ♦ I 

i-0  1 1 i-0  1 1 l 


“ ,Wm-2w  N v 

Jo  9 1 1 * W - Jo  ^("TVjS/q+n 

r e1+1("‘ ])(“■')  <m-u  r e1+1(“J)(s‘l)/<i+n 

1-0  1+1  1 i-0  1 1 


- N/(M-l) 


which  is  a result  that  might  have  been  expected.  These  last  three  expressions 
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are  used  Co  evaluate  3 in  Section  5.3  whereupon  the  bandwidth  is  readily 
obtained . 

3.5.  An  approximate  asymptotic  analysis 

The  method  used  here  is  basically  the  one  used  by  Baskett  and  Smith 
[24]  under  the  name  of  the  "binomial  approximation"  and  is  asymptotically 
exact  as  M,  N-®.  However,  it  is  extended  to  handle  the  SFMM  for  Che 
referencing  behavior  and  more  attention  is  paid  to  details  which,  though 
unimportant  for  the  RIRM,  lead  to  additional  inaccuracy  with  the  model  assumed 
here . 

Assume  that  the  modules  are  labelled  0 through  M-l.  The  analysis 
concentrates  upon  the  queue  on  module  0 which  is,  for  the  most  part,  treated 
like  an  isolated  queue.  The  remaining  modules  are  relevant  insofar  as  they 
affect  the  arrival  process  to  module  0.  Two  assumptions  are  made  in  the 
analysis  : 

Assumption  1.  Given  that  n of  the  modules  1 through  M-l  are  busy  (when  the 
system  is  stationary),  any  combination  of  n modules  is  equally  likely  to 
be  the  set  of  busy  modules. 

Assumption  2.  The  distribution  of  the  number  of  busy  modules  in  the  station- 
ary system  is  independent  of  the  queue  length  at  module  0.  (.This  assumption, 
which  is  obviously  untrue  for  small  values  of  M,  N,  becomes  increasingly  accept- 
able as  M,N  become  very  large). 

Based  on  these  assumptions,  the  analysis  proceeds  through  the 
use  of  probability  generating  functions  (p.g.f.)  [42].  Let  h^  be  the  station- 
ary probability  that  queue  0 is  of  length  i.  (.The  queue  always  includes  the 
request  being  served).  Then  the  p.g.f.  Htz)  is  defined  to  be 
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j 
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H(z ) “ Eh  z . 
1-0  1 


Two  additional  sets  of  p.g.f. 's  are  Aj(z)  and  Bj(z)  which  correspond  to 
the  number  of  arrivals  to  queue  0 at  the  start  of  a cycle  given  that  j 
modules  in  the  entire  system  are  busy  under  two  conditions  respectively  - 
that  module  0 is  one  of  the  j busy  modules  or  that  module  0 is  idle.  If 
q^  is  the  stationary  probability  of  j busy  modules  then  define 

min (M,N) 

A(z>  - I Vl(Z> 

i-1  J J 


and 


min(M.N) 

B(z)  - £ q.B.(z), 


i-l 


'J  j 


By  assumption  2,  q.  is  independent  of  the  length  of  queue  0. 

If  the  queue  length  at  module  0 is  non-zero,  then  at  the  end  of  a 
cycle  it  will  be  decremented  by  one  with  probability  (1-p^)  and  then  in- 
cremented by  a random  variable  which  is  the  number  of  arrivals.  If  the  queue 
length  were  0,  it  would  merely  be  incremented  by  the  random  number  of  arrivals. 
Bearing  in  mind  that  the  addition  (subtraction’)  of  an  independent  random  variable 
corresponds,  in  the  transform  domain,  to  multiplication  (division)  by  the  p.g.f. 
of  the  random  variable  we  have  that  the  p.g.f.  for  the  queue  length  during 
the  next  cycle  is 

H*(z)  = p [H(z)  -H  (0)1  A (z ) + (1-p  ) <.0>]  A(z) 

0 0 2 


+ H(0)B(z) 


l*P, 


- [H(z)  - H(0)  ] (p  + u-  ) A ( z ) + H(0)B(z) 

0 2 


H(2>  by  stationarity . 
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Note  that  H(0)  * h^  “ the  probability  that  queue  0 is  empty.  The  above  equation 
may  be  rewritten  by  collecting  all  terms  involving  H(z)  on  the  left  hand  side 
and  then  dividing  on  both  sides  by  the  coefficient  of  H(z)  to  obtain 


H(z ) = H (0) 


zB(z)  - (1-PQ  + Pqz)A(z) 
z - (1-p^  + p^z)A(z) 


The  strategy  is  to  solve  for  H(0),  the  idle  fraction,  by  evaluating 
the  equation  (or  some  derivative  of  it)  at  the  point  z =0  or  1 without  obtain- 
ing a trivial  identity.  It  turns  out  that  setting  z = 0 or  1 in  the  above 
equation  yields  obvious  identities.  However,  if  we  differentiate  both  sides 
with  respect  to  z and  set  z equal  to  1,  we  obtain  H'(l)  on  the  left  hand  side, 
which  is  the  mean  length  of  queue  0 and  this,  by  symmetry,  must  be  equal  to 
N/M.  The  evaluation  of  the  right  hand  sides  requires  two  successive  applica- 
tions of  L'Hospitals'  rule  and  after  some  simple, if  tedious .computation  yields 


[1-p  -A'(1)][2B'(1)  +B"(1)]  + B ' ( 1 ) [ 2 p A 1 ( 1 ) + A " (l)j 
H 1 (1)  »g  = H(0)  -U q 


2 ( 1 - PQ  -A'(l))* 


(5) 


The  idle  fraction  is  obtained  if  A'(l),  A"(l),  B'(l)  and  B"(l)  are  known. 

Let  be  an  ordered  set  of  all  combinations  obtained  by  selecting 
n modules  out  of  the  modules  1 through  M-l.  Let  c^  be  the  k-th  such  combina- 
tion in  w-  and  let  [k, ,...,k  } be  the  indices  of  the  n selected  modules, 
n In 

Then  if  i is  the  number  of  busy  .uodules  in  the  system  we  have 


A^z) 


i = l 

n (i-p_k 

j=i  j 


M-l 
'i- 1 


) 


+ P-k.2) 

j 


and 


where  p is  the  probability  of  a processor  referencing  module  0 after 
i 

referencing  module  k^.  (The  negative  index  denotes  the  positive  index  ob- 
tained by  performing  the  negation  modulo  M) . These  expressions  depend  on 
the  assumption  that  any  combination  of  modules  is  equally  likely  to  be  the 
set  of  busy  modules.  In  writing  down  A^(z),  the  fact  that  module  0 is  busy 
has  been  accounted  for  by  selecting  combinations  of  (i-1)  modules  from  the 
remaining  (M-l)  modules.  This  assumes  that  the  total  number  of  busy  modules 
is  independent  of  the  queue  length  at  module  0.  Now, 


A[(l) 


E 1"lp-k/(i-i) 

.si.i  s'1  ^ 


i-1  i-1 


and 


/<?:?> 


AVCD  = \ P-k1P-kr  ' (i-l 

1 ck6^i-i  r:1  J ^ 

rrj 


M-l 

To  evaluate  A|(l)  further  note  that  of  the  ( possible  combinations,  module 

M-2 

t appears  in  exactly  ( ) combinations  and  consequently  p will  appear  in 

M-2 

the  summation  (.  j)  times.  Therefore, 


A’.(l) 


M-l  M 1 

r p m-1  = (i-1)  V (i-1 

j=l  j (i-2)  (i-l}  (M-l)  pj  " (Sn 


and  by  similar  arguments 
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M-l 
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, M-l 

[1  * 

2p 

0 

+ 2po  * E 

u j-0 

2(L-p  ) - p ) 

(M- l ) (M-2 ) * A Po)(Y  P0 


M-l 

where  y = 2(l-p  )/(l-  E p ) 
u j-0  J 


Therefore , 


min(M,N)  (l-p  ) min(M,N) 

A'(l)  - E q,  A!(0  - T"  v (i-l)q 
l-l  1 1 (M'U  i-l  1 


(M-l) 


(M>  - l) 


and 


A"(l) 


2(l-pQ) (l/y  - p0)  min (M,N) 


(M-l) (M-2) 


2(l-po)(l/y  - po) 
(M-L) (M-2) 


E (i- L > (i-2)q 
i-l 


i 


fv  + (U  - l)(u  - 2)] 
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where  g,  is  the  mean  number  of  busy  modules  in  the  system  and  v is  the  variance 
of  this  number.  Similarly, 


(l-tO 

B'(D  " 7T7~T\  M-  and 


(M-l) 


2(i-p0)(i/v  - p0) 
B"(D  = (M-l)  (M-2) 


[v  +u(|j,  - D] 


Now,  as  in  [24],  one  may  argue  that  as  M and  N both  become  extremely 
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large,  the  various  queues  become  Increasingly  Independent  of  one  another. 

Consequently,  the  central  limit  theorem  applies  to  the  distribution  of  the 

2 

number  of  busy  moduLes.  In  particular,  the  quantity  v/y,  -0  and,  hence, 


2(1  - p )(t/V  - pn) 

-"-(iwUM-i)  *"d 

2(1  - P.X1/Y  - P,.) 

B"(1)  (M-l)  (M-2)  ^ " l)* 

Equation  5 may  now  be  used  to  solve  for  the  value  of  4 by  substitut- 
ing the  (asymptotic)  values  of  A'(l),  A"(l),  B ' ( 1 ) and  B"(l)  and  noting  that 


H(0)  - 1 - M-/M 


to  obtain 


B(M,N,Y)  “M-  * Y(M  +N)  - 1 + v [y(M  + N)  - 1]  - 4 y(  y - 1)MN 


2 (Y  * 1) 


(6) 


the  appropriate  sign  being  chosen  to  ensure  that  4 is  not  greater  than  either 
M or  N. 

A few  special  cases  are  of  interest.  For  the  RIRM,  y is  equal  to 
to  2 and  the  expression  reduces  to  that  obtained  by  Haskett  and  Smith  via 
the  "binomial  approximation".  As  p^-l,  y-l  and  the  queueing  system  tends 
to  an  exponential  queueing  network.  This  is  clear  if  one  views  a sequence  of 
successive  visits  of  unit  service  time  to  a module  as  constituting  one  visit 
with  a geometrically  distributed  service  time.  As  p^  - 1 , the  mean  service 
time  becomes  exceedingly  greater  than  one  cycle  and  if  the  distribution  func- 
tion is  scaled  so  that  the  mean  is  one,  then  the  stepwise  increasing  function 
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begins  to  appear  continuous  and  exponential.  For  a value  of  Y * 1, 


B (M,N , 1) 


MN 

M + N - 1 


which  is  the  formula  derived  in  Section  3.3  for  the  exponential  model.  As 
Pq  "•  0 and  p ^ — L for  some  I/O,  Y — * and 


B(M,N,®)  - min (M,N) , 


This  is  the  expected  value  for  the  bandwidth  since  all  the  processors  get 
synchronized.  The  range  of  variation  of  Y is  between  1 and  03  and  p is  an 
increasing  function  of  Y.  At  the  two  extreme  values  of  Y»  the  value  of  ^ is 
exact.  However,  at  intermediate  points  such  as  y =2,  it  is  fairly  accurate 
but  not  exact.  Also,  since  the  assumptions  made  are  correct  for  M.N-50,  the 
expression  is  asymptotically  correct. 


4.  An  exact  analysis 

The  analysis  in  this  section  makes  use  of  the  probability  generating 
function  approach  and  bears  some  resemblance  to  the  analysis  of  Section  3.5. 
However,  it  makes  no  approximations.  The  state  of  the  system  during  any  cycle 
is  given  by  the  vector  Cn^.-.-.n^  j>  which  lists  the  number  of  requests  queued 
at  each  module.  The  set  of  all  feasible  states  constitutes  a Markov  chain 
since  the  probability  distribution  of  the  next  state  depends  only  on  the 
current  state.  Furthermore,  this  Markov  chain  is  aperiodic  since  it  is 
possible  to  make  a one-step  transition  from  any  state  back  to  the  same  state, 
and  it  is  irreducible  since  it  is  possible  to  reach  any  state  from  any  other 


state  in  a finite  number  of  transitions.  Hence,  the  Markov  chain  is  ergodic 
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and  possesses  a unique  stationary  probability  distribution. 

Let*'  be  the  class  of  all  feasible  states  (for  which  nQ+. . . 

Let  P(Sj)  be  the  stationary  probability  of  being  in  state  Sj  and  let  the 
corresponding  vector  be  <nQj > • • • ,nM-i  j)*  Then,  define  the  probability 
generating  function 


+Vi 


N). 


M-l 


H(Z)  -H(z  ....  ,z^  ,)  = E p(S  ) II  z nij. 

0 s.e^  J i-o  1 

Let  Q^(Z)  be  the  p.g.f.  for  the  routing  probabilities  of  a processor  whose 
last  request  was  to  module  i,  i.e., 


Q‘(2,-,‘(!o 


where  the  subtraction  in  the  subscript  is  performed  modulo  M. 

At  the  end  of  a cycle,  each  module  which  was  not  idle,  releases  a 
serviced  processor  which  is  then  reassigned  to  some  module  based  on  the  routing 
probabilities.  Define  to  be  the  operator  which  operates  on  H(Z)  to  yield 
the  p.g.f.  of  the  system  after  module  i has  released  a processor  (if  not 
idle).  Just  as  z^  is  the  dummy  variable  associated  with  processors  enqueued 
at  module  i,  let  y^  be  the  dummy  variable  associated  with  a processor  that 
has  just  been  released  by  module  i.  Lastly,  if  IL  is  defined  to  be  the 
operator  which  sets  * 0 in  H(Z)  then,  just  as  in  Section  3.5, 

y* 

RiH(Z)  - UtH(Z)  + [H(Z)  -UtH(Z)J  -f 

y y 

« -i  H(Z)  + (1  --i)U  H(Z). 

Zi  Zi  L 


From  the  above  expression  one  may  define 


2b 


*i  • ir + <l-r>tui)i 


yj  y- 

:i  "i 

Note  that  IL  and  U^,  (i/J),  are  commutative  operations  since  they  operate 
on  distinct  dummy  variables.  Therefore,  at  the  end  of  a cycle,  after  each 
module  has  released  a processor  (if  any),  the  p.g.f.  for  the  system  is 


M-L  M-l  y y 

[ n R } H(Z)  - ( n [-±  + (1  --1)  UJ}H(Z). 
i-0  i-0  2i  zi  1 

In  the  transform  domain,  the  processors  are  reassigned  by  substituting 
(^(Z)  for  y ^ in  the  above  expression  to  obtain  the  p.g.f.  for  the  system 
during  the  next  cycle.  By  stationarity  this  must  be  equal  to  H(Z)  and, 
therefore , 

M-l 

H(Z)={  H [Q.(Z)/z.  +(1-Q.(Z)/z.)uJ}H(Z).  (7) 

i«0  111 

This  equation  is  exactly  equivalent  to  equating  the  stationary  probability 
vector  for  the  underlying  Markov  chain  to  the  product  of  the  san*?  vector 
and  the  transition  matrix.  However,  it  does  not  require  the  enumeration  of 
all  the  states  and  is  consequently  more  convenient.  It  also  permits  the 
analysis  to  proceed  in  a much  smoother  fashion. 

Let  Di  be  the  operator  that  differentiates  with  respect  to  z . Then, 
bearing  in  mind  that 

D1(UJH(Z))  - / Uj^HCZ))  i*J 

i-j 

V 

one  observes  that 


Di  { [Qj  (Z)/Zj  + (1  -Q  <Z)/*  )U  ]H(Z)} 
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-[[D1(Qj(Z)/zj)  + (Q.(Z>/z.)D. 

+ [Dt(l -Q  (Z)/*j)]Uj  + (1-QjCZ)/*  )DtU  }H(Z) 

By  the  symmetry  of  the  problem,  the  marginal  queue  length  distribu- 
tions for  all  modules  are  identical  and  it  is  necessary  to  study  only  one, 
the  one  for  module  0.  The  p.g.f.  of  this  marginal  distribution  is  obtained 
by  setting  z^  ■ . . . - z^_^ « 1.  For  notational  convenience  z^  is  set  equal  to  z 
and  the  following  definitions  are  made: 

* 

given  that  Z - [z,l, 1}  define 

h (z)  - H ( Z*) 

V^z)  - Qt (Z*) 
uQh(z)  * h(0) 

Uth(z)  - U1H(Z(i))  i i 0 

where  Z^  corresponds  to  z^-z,  z^  ” 0 and  Zj*l(j#0,i). 

Equation  7 now  assumes  the  form 

V (z)  V (z)  M-l 

h(z)  ={[  -J7~  + (1  - ) UQ]  n [V.(z)  + (l-Vt(z))Ut]  h(z). 

M-l 

The  right  hand  side  includes  a term  in  h(z)  with  the  coefficient  (1/z)  II  V. (z). 

i=0  1 

Collecting  terms  in  h(z)  on  the  left  hand  side  and  the  dividing  on  both  sides 

by  the  coefficient  of  h(z)  one  obtains 

M-l  M-l 


z - n V (z) 
i-0  L 


28 


The  most  convenient  and  non-trivial  identity  is  obtained  by  putting  z * i. 

This  entails  applying  L'Hospital's  rule  twice  on  the  right  hand  side  and  the 
rules  regarding  the  commutativity  of  Dq  and  (or  V ) must  be  borne  in  mind. 
The  result  of  setting  z - 1 is  to  get  h(l)  * 1 


M-l  M-l  M-l 

2 EvICDu.h'd)  • S E V!(1)V;(1)[1 -U  -U  +U  U ]h(l) 
i-1  1 i-0  J-0  1 J 1 J 1 J 

j/t 


M-l 


M-l 


*-2  E VJ(1)[U  -UgUjMl)  + 1 - S [V'(l)]* 

i-1  i-0 


(8) 


M-l  2 

1 - E [V'(1)J 
i-0 


where  V^(l)  - p ^ 

Ujh(l)  * P[ module  j is  idle] 

Ujh(l)  - Ujh(l)  for  all  i,j  by  symmetry 
U^Ujh(l)  - P[module  i and  module  j are  idle] 

* P[module  i is  idlej  module  j is  idle] . P[module  j is  idle] 


and  U^h'(l)  = [Mean  queue  length  at  module  Ol  module  i is  idle]. 

Pfmodule  i is  idle] 

Our  interest  is  in  solving  for  Ugh(l)  but  due  to  the  presence  of  terms  of  the 
form  U^Ujh(l)  and  U^h'(l)  we  cannot  do  so  immediately. 

5.  "Almost-exact"  solutions 

The  analysis  of  the  proceeding  section  can  continue  in  certain  special 
cases  by  making  assumptions  which  are  intuitively  plausible: 

5.1.  The  RIRM  case 


Define  F(M,N)  to  be  the  marginal  probability  of  a module  being  idle 


■ 


L 

1 

1 

i. 

t 

1 

I 

{ 

1 

i 

I 

1 

I 

I 


1 


in  a system  with  M modules  and  N processors . Therefore,  FfM.N)  - U hfl)  for 
all  i.  The  approximation  introduced  in  1 2 7 ] is  to  assume  that  the  marginal 
queue  length  distribution  at  a module  in  a system  with  M modules,  given  that 
some  other  module  is  idle,  is  identical  to  the  marginal  queue  length  distri- 
bution in  a system  with  only  M- 1 modules.  The  use  of  a slightly  stronger 
approximat ion  has  been  suggested  in  (24).  As  a consequence  of  this  assumption, 

l^Ujhfl)  - F(M-l,N)FfM,N)  and 

V',(1)  ‘ <m?I7  • F^M’N> 

and  Equation  3 yields 


+ F(M-1,N) 

This  recurrence  may  be  solved  by  noting  that  Fi,l,N>  * 1 for  all  Ni  1.  In  [^7] 
it  is  shown  that  the  recurrence  is  satisfied  by  the  following  solution: 


FfM.N)  - 1 - 

a 


where  the  memory  bandwidth  Bt,M,N)  is  given  by 


BfM.N) 


* . M-l.  N- 1 
I z1  < i > < i > 

i-0 

* i 
E z 


i-0  (i+1) 


?7><T> 


Note  that  the  above  summations  are  finite  since  the  terms  become  0 for  i^M  or 
N.  This  expression  has  been  compared  to  the  exact  results  listed  in  [22] . 

The  maximum  error  was  found  to  occur  for  values  of  M-N.  Table  5 compares  the 
error  of  Strecker's  approximation,  the  Binomial  approximation  and  the  above 


I 


H 

i 
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expression  for  values  of  M=N.  The  error  of  this  "almost-exact"  solution  is 

always  less  than  0.257.. 

5.2.  The  Localized  Reference  Model 

The  assumption  made  for  the  RIRM  analysis  seems  equally  justified 

when  analyzing  the  localized  model  of  Sethi  and  Deo.  Let  p =ot.  Then  from 

0 

Equation  8 one  obtains  [27] . 


I 

I 

I 

l 

t 

£ 

t 

I 

I 

I 


( 

1 


F (M,N ,2) 


iN 

m^T 


+ 2-Y  + (y-I)F(M-I.N) 


and 


B(M,N,ar) 


z Y1  ("•/h":1) 

i-0 


i-0 


(i+1) 


(MJl)(N’1) 


where 


(l+a)M-2 


This  result  cannot  be  directly  tested  for  accuracy  since  no  exact  analysis 
has  been  performed.  However,  Sethi  and  Deo  [26]  have  performed  simulation 
runs  each  of  length  5000  cycles  for  values  of  N and  M ranging  from  2 through 
8 and  2 through  10  respectively.  For  each  value  of  (M,N)  they  have  tabulated 
the  bandwidth  averaged  over  values  of  **  0.75,  0.8,  0.85,  0.9  and  0.95. 

Table  6 displays  these  values  for  M = N and  compares  them  with  corresponding 
values  obtained  via  the  above  expression.  The  percentage  difference,  too, 
is  tabulated.  This  difference  is  the  combined  effect  of  the  error  in  the 
analysis  as  well  as  the  statistical  error  in  the  simulations.  The  difference 
is  less  than  17.  over  the  range  of  M and  N examined. 


« 


,1 


] 
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3.3.  The  Symmetric  First-order  Markov  Model 

The  assumption  made  for  the  RIRM  and  the  LRM  in  Sections  5.1  and 
5.2  respectively  cannot  be  accepted  quite  as  readily  in  the  case  of  the  SFMM. 
The  system  with  one  less  module  will  have  a different  transition  matrix 
which  cannot  be  easily  related  to  that  of  the  original  system.  However,  if 
one  is  willing  to  accept  the  diffusion  approximat ion  then  the  exact  analysis 
of  Section  4 can  be  continued  to  provide  an  approximate  result. 

Using  the  terminology  of  Section  4 and  the  results  of  Section  3.4 
(Equations  2-4)  the  following  equations  are  obtained: 


Ujh(l) 


® 1 M-  1 N-  1 

re  i >(  ) 

. H+i'  1 i ’ . , 

I “ d 

E 0 ( M H^1) 

i-0  Vi+lu  i ' 


for  all  J 


^ 1 ’ * , . 

CO  ,•  w Ml 

r a1/  M \ /N“1n  d 

,V  (i+i)(  i ) 

1-0 


Ihh  ' ( l)  - for  all  t 


rv.’d)  - i 

i-0  1 


M-l 

iv;a)  - i-p 

, , i r 


E IV ' CD]  - I P, 
i-0  i-0  1 


M-l  M-l  M-l 

s s V:(1)V;(1)  - 1 - e p s 

i-0  J-0  J i-0 


Then,  from  Equation  8, 
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* 2 , N A 

l-  - p ■=  2 (1-p  ) — 7 ~ 

i-0  1 0 M-l  D 


00  J 


+ 2(1  - Ip  .l  + p)= 
i-0  1 0 D 


+ t2^  >-(l-iSo  Pi‘)Jd 


Substituting  Y for  2(1  - p^)/(l  - one  gets 


( + 2 -Y)A  + (Y  - 1)B  - D 


Now , 


^A-YA.YB»Yj0^U-)(^)-C;>(N;S+(^(-)] 


Yiloeii(Mi2)(?+i)  + <Mi2)(NIl)  - ^V:1)] 


Y i e1  (M-2)(N-1) 

Yi»0  ^ i Mi+l' 


and 


D - 2A  + B - ,|0  elKt“1)  <MIl)  -2(“;|)(N;1)  + (”+?)  «NI1)] 


” ,ir  M- 1.  N- 1.  M-2X  N- 1 , 
tS0  9 t(  i )(  i )-(  i )(  i )] 


? ai.M-2..N-l 

i-0  3 (i-l)(  i ) 


? .i  H-2WN-1.  . M-2. 

i-i  0 i > since  ( _L)  « 0 


r i+l.M-2,  .N-l, 

iso  5 < i ’W 


Therefore  9 =y  and  the  bandwidth 


from  Equation  1 
(DGF)  solution. 

It  is 


B(M,N,Y)  = 


» i .M- 1.  ,N-  1. 

2 Y ( i H i > 
i°0  1 

v Y, ' . ,M-1.  N-l 
. n i+1  v i M i 

i=0 


This  expression  is  termed 


) 

the  Diffusion/Generating  Function 


easily  ascertained  that  B(M,N,y)  satisfies  the  recurrence 


B(M,N,>) 


m YN-  - 1 ) B (M  - 1,N,Y) 

‘ yN+M  - 1 - (Y  - 1)B(M  - 1,N,Y)’ 


This  recurrence  may  be  used  to  derive  the  asymptotic  result  of  Section  3.5. 
For  Large  values  of  M it  might  be  expected  that  the  addition  of  another 
module  would  not  have  an  appreciable  effect  upon  the  bandwidth,  i.e., 
B(M-1,N,Y)  S:  B(M,N,Y).  The  use  of  this  approximate  equality  converts  the 
recurrence  relation  into  a quadratic  equation  in  B(M,N,Y),  the  solution 
of  which  yields  Equation  6. 

The  expression  for  B(M,N,Y)  is  symmetric  in  M and  N when  Y is 
fixed.  However,  since  Y may  be  expected  to  vary  with  M,  the  bandwidth 
observed  in  a system,  in  which  all  processors  have  identical  reference 
behavior,  will  not  be  symmetric  in  M and  N. 

The  expression  for  B(M,N,Y)  is  exact  if  Y = 1 or  if  Y = 00 . For 
Y =2,  the  result  of  Section  5.1  for  the  RIRM  case  is  obtained.  This  was 
found  to  be  accurate  within  a quarter  of  one  percent.  To  assess  the  error 
of  this  expression  for  an  arbitrary  value  of  Y,  four  simulations  were  per- 
formed corresponding  to  those  described  in  Section  2 (N=4  and  M=2,4,8,16). 
The  measured  bandwidth  was  compared  with  the  bandwidth  computed  analytically 
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using  the  value  of  V obtained  from  the  estimated  transition  probabilities 
(Table  4).  The  discrepancies  may  be  attributed  to  four  factors; 

1)  the  SFMM  is  not  an  adequate  model  of  program  behavior  due  to  the 
presence  of  higher  order  dependencies, 

2)  the  reference  patterns  of  the  four  processors  are  not  statistically 
identical  as  assumed  by  the  model, 

3)  the  statistical  error  in  the  simulation, 

4)  the  approximations  introduced  in  the  analysis. 

Table  7 displays  the  bandwidths  obtained  from  the  simulation,  the 

DGF  solution  and  the  asymptotic  solution.  The  percentage  errors  are  also 

shown  in  parentheses.  The  agreement  is  seen  to  be  quite  satisfactory. 

In  an  attempt  to  separate  the  error  caused  by  the  analysis  from 
that  due  to  assumptions  in  the  model,  two  additional  simulations  were  per- 
formed for  N=4  and  M=2,4.  Of  the  four  factors  listed  above,  the  first  two 
were  eliminated  by  generating  reference  strings  which  conformed  exactly  with 
the  model,  i.e.  all  four  reference  strings  were  statistically  identical  and 
were  generated  using  the  SFMM  parameters  of  Table  4.  Both  runs  were  2500 
cycles  long.  Table  8 records  the  measured  bandwidth,  the  analytical  predic- 
tion and  the  percentage  error.  In  conjunction  with  Table  7 one  may  conclude 
that  a major  portion  of  the  error  is  due  to  the  fact  that  the  SFMM  model  is 
not  quite  adequate  in  describing  real  reference  strings.  However,  absolutely 
speaking,  the  error  caused  by  modelling  approximations  is  acceptably  small. 
The  error  due  to  the  analysis  itself  (Table  8)  is  smaller  still.  (Note  that 
the  recorded  error  also  includes  the  statistical  simulation  error). 
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6.  A Two-parameter  model 

The  Symmetric  First-order  Markov  Model  would  appear  to  be  an 
adequate  model  of  program  behavior,  but  it  has  one  drawback  in  that  a 
separate  model  is  needed  for  each  degree  of  interleaving.  It  is  clearly 
desirable  that  the  specif ication  of  the  model  of  program  Del.  vior  be  in- 
dependent of  the  physical  structure  of  the  memory  since  fi  stly,  rhe  pro- 
gram exists  regardless  of  the  memory  and.  -.econuty,  greater  parsimony  and 
generalityare  achieved  by  having  just  one  moaei  which  is  applicable  to  any 
memory  structure  of  interest. 

To  obtain  a memory  interleaved  8 ways  from  one  interleaved  16  ways, 
module  i is  combined  with  module  i + 8 (modulo  16).  This  assumes  low-order 
interleaving  of  the  modules.  Correspondingly,  states  i and  i+8  must  be 
combined  in  the  Markov  chain  which  describes  program  behavior.  It  is 
neccessary  to  establish  whether  the  resulting  lumped  chain  has  the  Markov 
property.  If  so,  this  permits  the  derivation  of  the  model  of  program  be- 
havior for  M = 8 from  the  for  M = 16. 

In  the  general  case  one  has  M(?  2)  modules  (M,  being  a power  of  2 
and  the  associated  SFMM  parameters  p^(M)  through  p^_^(M).  A partition 

A = {A  .A^}  is  defined  upon  the  state  space  of  the  SFMM  whereby  state  i 

is  mapped  into  A^  where  j =i  mod  m and  m is  assumed  to  be  a power  of  2.  The 
Markov  chain  with  state  space  {0,...,M-l}  is  said  to  be  lumpable  with  respect 
to  the  partition  A if  each  equivalence  class  corresponds  to  a state  in  a 
Markov  chain  with  a reduced  state  space.  It  can  be  shown  [43]  that  a Markov 


chain  is  lumpable  with  respect  to  a partition  A if  and  only  if  for  any  x.ySA^, 


lb 


q . , «■  pfnext  state  6 A, /current  state  is  x] 

M ij  J 


= p[next  state  € A. /current  state  is  y]  for  aLl 

A.,  A,€A.  Furthermore,  if  this  condition  is  satisfied  q,^  is  the  conditional 
probability  of  a transition  to  A^  given  that  the  current  state  is  in  A^ . It 
is  obvious  by  inspection  of  the  SFMM  transition  matrix  that  this  condition 
is  satisfied  and  that  q^  ^ as  defined  above  will  be  given  by 

r-  1 

* V (M) 

Vi+j  " k=0  Pkm+J 

where  r = M/m  and  M>2  is  a power  of  2.  By  defining  p , (m)  to  be  equal  to 
★ 

q.  one  obtains  the  SFMM  parameters  for  a degree  of  interleaving  ol  m. 

1 , 1"TJ 

Hence,  the  SFMM  parameters  corresponding  to  the  maximum  degree  of  interleaving 
of  interest  can  be  used  to  derive  the  SFMM  parameters  for  any  lesser  degree 
of  interleaving. 

The  reverse  procedure  for  obtaining  the  parameters  for  2M  modules 
from  those  for  M modules  does  not  exist  due  to  an  excess  of  unknowns  over 
equations.  However,  an  examination  of  Table  4 indicates  that  it  is  reasonable 
to  assume  that 


Pt(M)  = p j (M)  for  i,J  i 0 or  1,  i.e. , 


P^M) 


i-p  q(M)-P1(M) 


M-2 


, I/O  or  l , 


and  that 


l-p  (2M)-p  (2M) 
■V™'  * 2M  - 2 - 


Then,  if  p^(M)  and  p^fM)  are  obtained  from  the  parameters  for  2M  modules  one 





p (M)  = p (2M)  + 
0 0 


l-p0(2M)-Pl(2M) 

2M  - 2 


l-p  (2M)-p  (2M) 
Pl(M)  = Pl(2M)  + 2M~ 2 


p^(2M)  and  p^(2M)  may  be  solved  for  to  obtain 

l-p  (M)-p  (M) 

_ /\t\  U *■ 


P0(2M)  = pq(M)  - 2^:~iy 


Pl(2M)  = ?1(M)  | 


l-PqW-p^m) 


(M-2) 


Noting  that  l 1-Pq(2M)-p^(2M)]  = ( 1- Pq(M) - (M) ] (M-l)/fM-2)  one  may  repeat 
this  process  to  obtain 


3 

l-p  (M)-p  (M) 

Pq(4M)  - pQ(M)  - £ . 

U L 

(M-2) 

3 

l-PflW-p.  (M) 

Pl(4M)  = pQ(M)  - £ . 

v 1. 

(M-2) 

If  one  repeats  this  process  indefinitely,  one  gets 


l-p  (M)-p  (M) 

* = P0(a°  = P0(M) W2) 


fj  = p^(co)  =.  p^(N)  - 


l-P0(M)-p1(M) 


(M-2) 


oi  and  3 may  be  interpreted  as  the  probabilities  of  referencing  the  same  word 
and  the  sequentially  following  word  respectively.  For  any  M, 
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f a + 


1-a-P 

M ’ 


i=0 


P1(M) 


3 + 


1-a-p 

M ' 


i*l 


1-a-a  , 1^0,1. 

M 


Thus,  the  two -parameter  model  of  program  behavior  assumes  that 
each  reference  with  probability  a references  the  same  module  as  the  last 
reference  and  with  probability  P it  references  the  sequentially  following 
module.  If  neither  of  these  events  occur  then  the  reference  is  random  and 
equally  likely  to  be  to  any  one  of  the  M modules.  Table  9 displays  the 
values  of  a and  P as  calculated  from  the  SFMM  parameters  for  M = 4,8  and  lb. 
Table  10  lists  the  percentage  errors  incurred  by  using  the  first  and  third 
pairs  of  values  for  a and  P(M=4  and  M^16)  to  calculate  memory  bandwidth, 
the  simulation  results  being  used  to  effect  the  comparison.  The  close  agree- 
ment serves  to  establish  the  validity  of  the  two- parameter  model.  Better 
results  are  obtained  when  a and  ^ are  estimated  via  the  SFMM  parameters  for 
larger  values  of  M. 


7.  Conclusion 

A study  of  the  memory  referencing  behavior  of  four  programs  indicates 
that  the  often  used  Random  Independent  Reference  Model  of  program  behavior  is 
not  valid.  The  Independent  Reference  Model  and  the  Localized  Reference  Model 
were  likewise  ruled  out  as  inadequate.  The  Symmetric  First-order  Markov  Model 
was  proposed  as  a realistic  model  of  the  meory  reference  string  and  various 
approximate  but  accurate  analyses  were  performed  upon  it  to  yield  expressions 


39 


for  the  memory  bandwidth.  It  was  also  demonstrated  that  the  model  of  program 
behavior  could  be  further  simplified  to  obtain  a two-parameter  model  without 
sacrificing  accuracy  appreciably. 
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Table  5.  Percentage  errors  of  various  analytical  solutions  for  the 
RIR>1  case  with  M = N. 


M(-N) 


Simulat ion 
results 
(Sethi  and 
Deo) 


1.374  1.844  2.347  2.844  3.336  3.843  4.317 
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error  of 
DGF  0 
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Table  6.  Accuracy  of  the  DGF  solution  for  the  LRM  case 
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